Inputs are combined into a weighted sum, adjusted by a bias, and then passed through an activation to produce the output.
The same pattern is repeated across many neurons to form layers.
graph LR
x1([x1]) -->|w1| S
x2([x2]) -->|w2| S
x3([x3]) -->|w3| S
S["$$b + \sum_{k=1}^K w_k x_k$$"] --> A["$$\phi()$$"]
A --> y([output])
The summation rule in practice
Inputs aligned with positive weights push the neuron’s output upward, while inputs with negative weights pull it downward.
The bias sets the level the neuron tends to output when inputs are near zero.
Together the weights and bias define a decision boundary in the space of inputs.
Activation functions \(\phi\) at a glance
The activation function makes the function modelled by network non-linear.
Why the activation matters for learning
Without activation functions, a stack of layers would collapse to a single linear transformation that cannot model curves.
Non-linear activations allow networks to fit bends, thresholds, and interactions that are essential in real data.
Appropriate activations also help gradients flow during learning so the network can be trained effectively.
From neuron to network
A layer contains many neurons that look at the same inputs in parallel.
The outputs of one layer become the inputs to the next layer, allowing the network to build up more complex representations.
flowchart LR
subgraph Input
x1((x1))
x2((x2))
end
subgraph Hidden
h1((h1))
h2((h2))
end
subgraph Output
y((y))
end
x1 --> h1
x1 --> h2
x2 --> h1
x2 --> h2
h1 --> y
h2 --> y
From neuron to network (maths)
Assuming one “hidden” layer of \(K\) neurons, and \(J\) inputs, the network output is: \[
y = \phi\left(b_y + \sum_{k=1}^{K} v_k h_k\right),
\] where each hidden neuron output \(h_k\) is \[
h_k = \phi\left(b_h + \sum_{j=1}^{J} w_{jk} x_j\right).
\]
Deep learning vs artificial neural networks
Artificial neural networks (ANNs) are the general family of models built from interconnected computational “neurons” that map inputs to outputs.
Deep learning refers to training ANNs with multiple stacked (deep) layers and the practical ecosystem — large data, specialised architectures (CNNs, RNNs, transformers), optimisation techniques, and hardware — that enables hierarchical representation learning.
In short: deep learning ⊂ ANNs — deep learning denotes deep, large-scale ANNs plus the methods and infrastructure that make them effective.
Single-layer era: perceptrons
In 1958 Frank Rosenblatt introduced the perceptron as a simple neuron model that could learn linear decision boundaries from labeled examples.
Early experiments on image recognition created excitement because the perceptron demonstrated that machines could adapt their internal parameters from data.
Researchers soon recognized that single-layer perceptrons could not represent functions like exclusive-or, which hinted at the need for multilayer architectures.
The “dark ages” after Minsky and Papert
In 1969 Marvin Minsky and Seymour Papert published a rigorous critique showing fundamental limits of perceptrons, which discouraged funding and interest.
Through the 1970s and early 1980s, symbolic approaches dominated artificial intelligence while connectionist methods received limited attention.
Small pockets of research persisted, but progress slowed because training deeper networks remained computationally and algorithmically difficult.
Backpropagation and the return of learning in multilayer nets
In 1986 David Rumelhart, Geoffrey Hinton, and Ronald Williams popularized error backpropagation, which made it practical to train multilayer feedforward networks.
Yann LeCun and colleagues demonstrated convolutional neural networks with gradient descent for handwritten digit recognition, linking neural networks to computer vision.
Despite these advances, compute, data, and hardware constraints limited performance on large real-world problems for another two decades.
Deep learning renaissance
Around 2006 Geoffrey Hinton, Ruslan Salakhutdinov, Yoshua Bengio, and others showed that layer-wise pretraining could initialize deep networks effectively.
In 2012 Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton won the ImageNet Large Scale Visual Recognition Challenge with AlexNet, powered by GPUs and ReLU activations.
Rapid progress followed across speech, vision, and language as larger datasets, better regularization, and improved architectures unlocked new capabilities.
Attention and the Transformer
In 2017 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, and colleagues introduced the Transformer architecture with the principle “attention is all you need.”
Transformers replaced recurrent computation with self-attention, enabling efficient parallel training and superior performance on sequence modeling tasks.
Subsequent models such as BERT, GPT, and ViT scaled data and parameters, establishing Transformers as a general-purpose foundation across modalities.
The current golden age
Large-scale pretraining, transfer learning, and instruction tuning have produced versatile models that support research, industry, and education at unprecedented scale.
Advances in optimization, tooling, and hardware have turned neural networks into an engineering discipline with reproducible pipelines and strong empirical baselines.
Ongoing debates about safety, evaluation, and societal impact now accompany technical progress, as researchers balance innovation with responsible deployment.
How learning works
First, the network starts with random weights and biases.
Then it computes outputs for a batch of inputs and measures error with a loss function that compares outputs to targets.
Next, gradients indicate how each weight should change to reduce the loss, and the parameters are nudged in that direction.
This cycle repeats over many passes until performance on held-out data stops improving.
Diagram of the learning loop (backpropagation)
The network produces predictions from inputs, the loss measures the gap to targets, gradients are computed, and the weights are updated.
Repeating this loop gradually improves performance if the task is learnable and the data are informative.
flowchart LR
A[inputs] --> B[network]
B --> C[predictions]
D[targets] --> E[loss]
C --> E
E --> F[compute gradients]
F --> G[update weights]
G --> B
Overfitting in simple terms
A network can memorise training examples instead of learning general patterns that transfer to new data.
Performance on validation data helps detect this because accuracy on unseen data can decline while training accuracy keeps rising.
Simpler architectures, regularisation techniques, and more diverse data reduce overfitting.
What to remember
A network is a flexible function built from many simple units arranged in layers.
Each unit adds weighted inputs, shifts the result with a bias, and applies a non-linear activation.
Learning is an iterative process that reduces a loss on examples by adjusting the weights and biases with gradient information.
Depth, non-linearity, and sufficient data provide the power seen in modern applications.
Further reading
Bishop, C. M., & Bishop, H. (2023). Deep learning: Foundations and concepts. Springer. (SpringerLink)
Prince, S. J. D. (2023). Understanding deep learning. MIT Press. (MIT Press)